47 research outputs found
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Unsupervised Language Acquisition
This thesis presents a computational theory of unsupervised language
acquisition, precisely defining procedures for learning language from ordinary
spoken or written utterances, with no explicit help from a teacher. The theory
is based heavily on concepts borrowed from machine learning and statistical
estimation. In particular, learning takes place by fitting a stochastic,
generative model of language to the evidence. Much of the thesis is devoted to
explaining conditions that must hold for this general learning strategy to
arrive at linguistically desirable grammars. The thesis introduces a variety of
technical innovations, among them a common representation for evidence and
grammars, and a learning strategy that separates the ``content'' of linguistic
parameters from their representation. Algorithms based on it suffer from few of
the search problems that have plagued other computational approaches to
language acquisition.
The theory has been tested on problems of learning vocabularies and grammars
from unsegmented text and continuous speech, and mappings between sound and
representations of meaning. It performs extremely well on various objective
criteria, acquiring knowledge that causes it to assign almost exactly the same
structure to utterances as humans do. This work has application to data
compression, language modeling, speech recognition, machine translation,
information retrieval, and other tasks that rely on either structural or
stochastic descriptions of language.Comment: PhD thesis, 133 page
Methods for Parallelizing Search Paths in Phrasing
Many search problems are commonly solved with combinatoric algorithms that unnecessarily duplicate and serialize work at considerable computational expense. There are techniques available that can eliminate redundant computations and perform remaining operations concurrently, effectively reducing the branching factors of these algorithms. This thesis applies these techniques to the problem of parsing natural language. The result is an efficient programming language that can reduce some of the expense associated with principle-based parsing and other search problems. The language is used to implement various natural language parsers, and the improvements are compared to those that result from implementing more deterministic theories of language processing
On The Unsupervised Induction Of Phrase-Structure Grammars
This paper examines why some previous approaches have failed to acquire desired grammars without supervision, and proposes that with a different conception of phrase-structure supervision might not be necessary. In particular, it describes in detail some reasons why SCFGs are poor mod- 2 CARL DE MARCKEN els to use for learning human language, especially when combined with the inside-outside algorithm. Following up on these arguments, it proposes that head-driven grammatical formalisms like link grammars (Sleator and Temperley, 1991) are better suited to the task, and introduces a framework for CFG induction that sidesteps many of the search problems that previous schemes have had. In the end, we hope the analysis presented here convinces others to look carefully at their representations and search strategies before blindly applying them to the language learning task. We start the discussion by examining the differences between the linguistic and statistical motivations for phrase structure; this frames our subsequent analysis. Then we introduce a simple extension to stochastic context-free grammars, and use this new class of language models in two experiments that pinpoint specific problems with both SCFGs and the search strategies commonly applied to them. Finally, we explore fixes to these problems
Linguistic Structure as Composition and Perturbation
This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of existing parameters. The power of the representation is demonstrated by several examples in text segmentation and compression, acquisition of a lexicon from raw speech, and the acquisition of mappings between text and artificial representations of meaning
The Acquisition of a Lexicon from Paired Phoneme Sequences and Semantic Representations
We present an algorithm that acquires words (pairings of phonological forms and semantic representations) from larger utterances of unsegmented phoneme sequences and semantic representations. The algorithm maintains from utterance to utterance only a single coherent dictionary, and learns in the presence of homonymy, synonymy, and noise. Test results over a corpus of utterances generated from the Childes database of mother-child interactions are presented. 1 Introduction This paper is concerned with the machine-learning of a lexicon from utterances that consist of an unsegmented phoneme sequence paired with a semantic representation of what those phonemes collectively mean. The problem is modeled after the environment that a child learns in, presented with a continuous speech signal 1 and potentially hypothesizing a meaning for that signal based upon visual stimuli. We radically simplify the problem the child encounters for the computer by pre-digesting the speech stream into a sequ..
Joseph Conrad's Letters to R. B. Cunninghame Graham [book].
These letters display Conrad's friendship with an aristocratic Scottish adventurer who was a pioneering socialist. The elaborate annotations set the friendship in cultural and political contexts, and show how Graham influenced Conrad's writings
Parsing The Lob Corpus
This paper s presents a rapid and robust parsing system currently used to learn from large bodies of unedited text. The system contains a multivalued part-of-speech disambiguator and a novel parser employing bottom-up recognition to fred the constituent phrases of larger structures that might be too difficult to analyze. The results of applying the disambiguator and parser to large sections of the Lancaster/ Oslo-Bergen corpus are presented